What is Spark?¶
In memory distributed data processing framework/engine.
Spark Features:¶
- In memory computation
- Open source
- Cost effective
- Fault-tolerance
- Supports multiple languages (Python, Scala, Java, and R)
- Lazy Evaluation
- Batch/Near real-time data streaming
- Very rich built-in library support
- As compared to Java (20–25 lines of code), we can replace with single line code in Python/Scala
- 100 times faster in memory than Hadoop MR or other traditional systems
- 10 times faster on disk
- Spark can integrate with Hadoop ecosystem, ETL tools like Talend, Informatica, etc., and cloud platforms (AWS, Azure, GCP)
On-Heap vs Off-Heap Memory¶
| Aspect | On-Heap Memory | Off-Heap Memory |
|---|---|---|
| Location | Inside JVM Heap | Outside JVM Heap (native memory) |
| Management | JVM Garbage Collector | Spark (Tungsten engine) |
| Default | Yes | No (needs config) |
| Performance | Slower for large data (GC overhead) | Faster for large data (no GC) |
| Storage Format | Java objects | Serialized binary |
| Use Cases | Small to medium datasets, default Spark jobs | Large datasets, Spark SQL/DataFrame, caching, joins, shuffles |
| Risk | GC pauses, OutOfMemoryError | Native memory leaks if misused |
| Configuration | No extra setup | spark.memory.offHeap.enabled=true and spark.memory.offHeap.size |
| Example | Default caching .cache() |
Optimized Tungsten memory, serialized storage |
Note in above code, when we are using reduceByKey it involves shuffling, so it creates a new stage(step).
In [ ]: